Poor Man's Stemming: Unsupervised Recognition of Same-Stem Words

نویسنده

Harald Hammarström

چکیده

We present a new fully unsupervised human-interventionfree algorithm for stemming for an open class of languages. Since it does not rely on existing large data collections or other linguistic resources than raw text it is especially attractive for low-density languages. The stemming problem is formulated as a decision whether two given words are variants of the same stem and requires that, if so, there is a concatenative relation between the two. The underlying theory makes no assumptions on whether the language uses a lot of morphology or not, whether it is prefixing or suffixing, or whether affixes are long or short. It does however make the assumption that 1. salient affixes have to be frequent, 2. words essentially are variable length sequences of random characters, and furthermore 3. that a heuristic on what constitutes a systematic affix alteration is valid. Tested on four typologically distant languages, the stemmer shows very promising results in an evaluation against a human-made gold standard.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Statistical Stemming for Kannada

Stemming is a process that groups morphologically related words into the same class and is widely used in information retrieval for improving recall rate. Here we study a set of statistical stemmers for Kannada, a resource-poor language with highly inflectional and agglutinative morphology. We compare stemming using simple truncation, clustering and an unsupervised morpheme segmentation algorit...

متن کامل

An Unsupervised Method to Improve Spanish Stemmer

In several tasks of Natural Language Processing is necessary to automatically extract a lemma or the stem of words. However the morphological variations of words do this job a hard work. That is why many algorithms for that purpose use external resources to resolve this problem. This is not a bad practice, but it does this task language-dependent and it has technical disadvantage. Perhaps this ...

متن کامل

HPS: High precision stemmer

Research into unsupervised ways of stemming has resulted, in the past few years, in the development of methods that are reliable and perform well. Our approach further shifts the boundaries of the state of the art by providing more accurate stemming results. The idea of the approach consists in building a stemmer in two stages. In the first stage, a stemming algorithm based upon clustering, whi...

متن کامل

Detecting Inflection Patterns in Natural Language by Minimization of Morphological Model

One of the most important steps in text processing and information retrieval is stemming—reducing of words to stems expressing their base meaning, e.g., bake, baked, bakes, baking → bak-. We suggest an unsupervised method of recognition such inflection patterns automatically, with no a priori information on the given language, basing exclusively on a list of words extracted from a large text. F...

متن کامل

بررسی تأثیرات ریشه‌یابی در بازیابی اطلاعات در زبان فارسی

Using the language-specific behavior in information retrieval systems can improve the quality of the retrieved results significantly. Part of the word that remains after removing its affixes is called stem. Stemming process can be used for improving the relevancy of the results in information retrieval system. Different morphological variants of words (plural, past tense…) will be mapped into t...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2006

Poor Man's Stemming: Unsupervised Recognition of Same-Stem Words

نویسنده

چکیده

منابع مشابه

Statistical Stemming for Kannada

An Unsupervised Method to Improve Spanish Stemmer

HPS: High precision stemmer

Detecting Inflection Patterns in Natural Language by Minimization of Morphological Model

بررسی تأثیرات ریشه‌یابی در بازیابی اطلاعات در زبان فارسی

عنوان ژورنال:

اشتراک گذاری